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SPIKE AND SLAB VARIABLE SELECTION: FREQUENTIST AND 
BAYESIAN STRATEGIES 

By Hemant Ishwaran 1 and J. Sunil Rao 2 
Cleveland Clinic Foundation and Case Western Reserve University 

Variable selection in the linear regression model takes many ap- 
parent faces from both frequentist and Bayesian standpoints. In this 
paper we introduce a variable selection method referred to as a rescaled 
spike and slab model. We study the importance of prior hierarchical 
specifications and draw connections to frequentist generalized ridge 
regression estimation. Specifically, we study the usefulness of con- 
tinuous bimodal priors to model hypervariance parameters, and the 
effect scaling has on the posterior mean through its relationship to 
penalization. Several model selection strategies, some frequentist and 
some Bayesian in nature, are developed and studied theoretically. 
We demonstrate the importance of selective shrinkage for effective 
variable selection in terms of risk misclassification, and show this is 
achieved using the posterior from a rescaled spike and slab model. We 
also show how to verify a procedure's ability to reduce model uncer- 
tainty in finite samples using a specialized forward selection strategy. 
Using this tool, we illustrate the effectiveness of rescaled spike and 
slab models in reducing model uncertainty. 

1. Introduction. We consider the long-standing problem of selecting vari- 
ables in a linear regression model. That is, given n independent responses 
Yi, with corresponding if-dimensional covariates x, = (a^i, . . . , Xi x) , the 
problem is to find the subset of nonzero covariate parameters from (3 = 
(/?!,..., 13k) 1 i where the model is assumed to be 

(1) Yi = a + f3ix iA -\ \- (3 K x it K + £i = ao + x t i (3 + E i , i = l,...,n. 
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The Si are independent random variables (but not necessarily identically 
distributed) such that E(si) = and E(ef) = a 2 . The variance o 1 > is 
assumed to be unknown. 

The true value for f3 will be denoted by j3 = ( flip, . . . , fix,®) 1 and the true 
variance of by Uq > 0. The complexity, or true dimension, is the number 
of Pk,o coefficients that are nonzero, which we denote by fco- We assume that 
1 <ko < K, where K, the total number of covariates, is a fixed value. For 
convenience, and without loss of generality, we assume that covariates have 
been centered and rescaled so that Ya=i x i,k = and Ya=i x i k = n f° r eacFi 
k = 1, . .. ,K. Because we can define Qo = Y, the mean of the Yj responses, 
and replace Y{ by the centered values Y{ — Y , we can simply assume that 
ao = 0. Thus, we remove ao throughout our discussion. 

The classical variable selection framework involves identification of the 
nonzero elements of (3 and sometimes, additionally, estimation of k$. Information- 
theoretic approaches [see, e.g., Shao (1997)] consider all 2 K models and se- 
lect the model with the best fit according to some information based crite- 
ria. These have been shown to have optimal asymptotic properties, but finite 
sample performance has suffered [Bickel and Zhang (1992), Rao (1999), Shao 
and Rao (2000) and Leeb and Potscher (2003)]. Furthermore, such methods 
become computationally infeasible even for relatively small K. Some solu- 
tions have been proposed [see, e.g., Zheng and Loh (1995, 1997)] where a 
data-based ordering of the elements of /3 is used in tandem with a com- 
plexity recovery criterion. Unfortunately, the asymptotic rates that need to 
be satisfied serve only as a guide and can prove difficult to implement in 
practice. 

Bayesian spike and slab approaches to variable selection (see Section 2) 
have also been proposed [Mitchell and Beauchamp (1988), George and Mc- 
Culloch (1993), Chipman (1996), Clyde, DeSimone and Parmigiani (1996), 
Geweke (1996) and Kuo and Mallick (1998)]. These involve designing a hi- 
erarchy of priors over the parameter and model spaces of (1). Gibbs sam- 
pling is used to identify promising models with high posterior probability 
of occurrence. The choice of priors is often tricky, although empirical Bayes 
approaches can be used to deal with this issue [Chipman, George and Mc- 
Culloch (2001)]. With increasing K, however, the task becomes more dif- 
ficult. Furthermore, Barbieri and Berger (2004) have shown that in many 
circumstances the high frequency model is not the optimal predictive model 
and that the median model (the model consisting of those variables which 
have overall posterior inclusion probability greater than or equal to 50%) is 
predictively optimal. 

In recent work, Ishwaran and Rao (2000, 2003, 2005) used a modified 
rescaled spike and slab model that makes use of continuous bimodal priors 
for hypervariance parameters (see Section 3). This method proved particu- 
larly suitable for regression settings with very large K. Applications of this 
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work included identifying differentially expressing genes from DNA microar- 
ray data. It was shown that this could be cast as a special case of (1) under 
a near orthogonal design for two group problems [Ishwaran and Rao (2003)] , 
and as an orthogonal design for general multiclass problems [Ishwaran and 
Rao (2005)]. Along the lines of Barbieri and Berger (2004), attention was 
focused on processing posterior information for (3 (in this case by consider- 
ing posterior mean values) rather than finding high frequency models. This 
is because in high-dimensional situations it is common for there to be no 
high frequency model (in the microarray examples considered K was on 
the order of 60,000). Improved performance was observed over traditional 
methods and attributed to the procedure's ability to maintain a balance 
between low false detection and high statistical power. A partial theoretical 
analysis was carried out and connections to frequentist shrinkage made. The 
improved performance was linked to selective shrinkage in which only truly 
zero coefficients were shrunk toward zero from their ordinary least squares 
(OLS) estimates. In addition, a novel shrinkage plot which allowed adaptive 
calibration of significance levels to account for multiple testing under the 
large K setup was developed. 

1.1. Statement of main results. In this article we provide a general anal- 
ysis of the spike and slab approach. A key ingredient to our approach involves 
drawing upon connections between the posterior mean, the foundation of 
our variable selection approach, and frequentist generalized ridge regression 
estimation. Our primary findings are summarized as follows: 

1. The use of a spike and slab model with a continuous bimodal prior for 
hypervariances has distinct advantages in terms of calibration. However, 
like any prior, its effect becomes swamped by the likelihood as the sample 
size n increases, thus reducing the potential for the prior to impact model 
selection relative to a frequentist method. Instead, we introduce a rescaled 
spike and slab model defined by replacing the ^-responses with y/n- 
rescaled values. This makes it possible for the prior to have a nonvanishing 
effect, and so is a type of sample size universality for the prior. 

2. This rescaling is accompanied by a variance inflation parameter A n . It 
is shown through the connection to generalized ridge regression that A n 
controls the amount of shrinkage the posterior mean exhibits relative to 
the OLS, and thus can be viewed as a penalization effect. Theorem 2 of 
Section 3 shows that if A n satisfies A n — > oo and A n /n — > 0, then the effect 
of shrinkage vanishes asymptotically and the posterior mean (after suit- 
able rescaling) is asymptotically equivalent to the OLS (and, therefore, 
is consistent for (3 ). 

3. While consistency is important from an estimation perspective, we show 
for model selection purposes that the most interesting case occurs when 
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X n = n. At this level of penalization, at least for orthogonal designs, the 
posterior mean achieves an oracle risk misclassification performance rel- 
ative to the OLS under a correctly chosen value for the hypervariance 
(Theorem 5 of Section 5). While this is an oracle result, we show that 
similar risk performance is achieved using a continuous bimodal prior. 
Continuity of the prior will be shown to be essential for the posterior 
mean to identify nonzero coefficients, while bimodality of the prior will 
enable the posterior mean to identify zero coefficients (Theorem 6 of Sec- 
tion 5). 

4. Thus, the use of a rescaled spike and slab model, in combination with 
a continuous bimodal prior, has the effect of turning the posterior mean 
into a highly effective Bayesian test statistic. Unlike the analogous fre- 
quentist test statistic based on the OLS, the posterior mean takes advan- 
tage of model averaging and the benefits of shrinkage through generalized 
ridge regression estimation. This leads to a type of "selective shrinkage" 
where the posterior mean is asymptotically biased and shrunk toward 
zero for coefficients that are zero (see Theorem 6 for an explicit finite 
sample description of the posterior). The exact nature of performance 
gains compared to standard OLS model selection procedures has to do 
primarily with this selective shrinkage. 

5. Information from the posterior could be used in many ways to select vari- 
ables; however, by using a local asymptotic argument, we show that the 
posterior is asymptotically maximized by the posterior mean (see Section 
4). This naturally suggests the use of the posterior mean, especially when 
combined with a reliable thresholding rule. Such a rule, termed "Zcut" , is 
motivated by a ridge distribution that appears in the limit in our analysis. 
Also suggested from this analysis is a new multivariate null distribution 
for testing if a coefficient is zero (Section 5). 

6. We introduce a forward stepwise selection strategy as an empirical tool 
for verifying the ability of a model averaging procedure to reduce model 
uncertainty. If a procedure is effective, then its data based version of the 
forward stepwise procedure should outperform an OLS model estimator. 
See Section 6 and Theorem 8. 

1.2. Selective shrinkage. A common thread underlying the article, and 
key to most of the results just highlighted, is the selective shrinkage ability 
of the posterior. It is worthwhile, therefore, to briefly amplify what we mean 
by this. Figure 1 serves as an illustration of the idea. There Z-test statistics 
Zk^n estimated by OLS under the full model are plotted against the corre- 
sponding posterior mean values (3% under our rescaled spike and slab model 
(the notation used will be explained later in the paper). As mentioned, these 
rescaled models are derived under a -y/n-rescaling of the data, which forces 
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Fig. 1. Selective shrinkage. Z-test statistics Z^^ n versus posterior mean values f)£ „ (blue 
circles are zero coefficients, red triangles nonzero coefficients). Result from Breiman sim- 
ulation of Section 8 with an uncorrelated design matrix, ko = 105, K = 400 and n — 800. 

the posterior mean onto a y'n-scale. This is why we plot the posterior mean 
against a test statistic. The results depicted in Figure 1 are based on a sim- 
ulation, as in Breiman (1992), for an uncorrelated (near-orthogonal) design 
where ko = 105, K = 400 and n = 800 (see Section 8 for details). Selective 
shrinkage has to do with shrinkage for the zero j3k,o coefficients, and is im- 
mediately obvious from Figure 1 . Note how the are shrunken toward a 
cluster of values near zero for many of the zero coefficients, but are similar 
to the frequentist Z-tests for many of the nonzero coefficients. It is precisely 
this effect we refer to as selective shrinkage. 

In fact, this kind of selective shrinkage is not unique to the Bayesian 
variable selection framework. Shao (1993, 1996) and Zhang (1993) stud- 
ied cross-validation and bootstrapping for model selection and discovered 
that to achieve optimal asymptotic performance, a nonvanishing bias term 
was needed, and this could be constructed by modifying the resampling 
scheme (see the references for details). Overfit models (ones with too many 
parameters) are preferentially selected without this bias term. As a connec- 
tion to this current work, this amounts to detecting zero coefficients — which 
is a type of selective shrinkage. 

1.3. Organization of the article. The article is organized as follows. Sec- 
tion 2 presents an overview of spike and slab models. Section 3 introduces our 
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rescaled models and discusses the universality of priors, the role of rescaling 
and generalized ridge regression. Section 4 examines the optimality of the 
posterior mean under a local asymptotics framework. Section 5 introduces 
the Zcut selection strategy. Its optimality in terms of risk performance and 
complexity recovery is discussed. Section 6 uses a special paradigm in which 
j3 is assumed ordered a priori, and derives both forward and backward se- 
lection strategies in the spirit of Leeb and Potscher (2003). These are used 
to study the effects of model uncertainty. Sections 7 and 8 present a real 
data analysis and simulation. 

2. Spike and slab models. By a spike and slab model we mean a Bayesian 
model specified by the following prior hierarchy: 

(F^Aay^Ntx^cx 2 ), i = l,...,n, 
(2) (/3| 7 )~N(0,r), 

7 ~ 7r(d7), 
a 2 ~ fi(da 2 ), 

where is the .fT-dimensional zero vector, T is the K x K diagonal matrix 
diag(7i, . . . ,7k), ^ is the prior measure for 7 = (71, . . . ,7^)* and \x is the 
prior measure for a 2 . Throughout we assume that both 7r and \i are chosen 
to exclude values of zero with probability one; that is, ^{^jk > 0} = 1 for 
k = l,...,K and /x{cr 2 > 0} = 1. 

Lempers (1971) and Mitchell and Beauchamp (1988) were among the ear- 
liest to pioneer the spike and slab method. The expression "spike and slab" 
referred to the prior for (3 used in their hierarchical formulation. This was 
chosen so that (3k were mutually independent with a two-point mixture dis- 
tribution made up of a uniform flat distribution (the slab) and a degenerate 
distribution at zero (the spike). Our definition (2) deviates significantly from 
this. In place of a two-point mixture distribution, we assume that (3 has a 
multivariate normal scale mixture distribution specified through the prior ir 
for the hypervariance 7. Our basic idea, however, is similar in spirit to the 
Lempers-Mitchell-Beauchamp approach. To select variables, the idea is to 
zero out (3k coefficients that are truly zero by making their posterior mean 
values small. The spike and slab hierarchy (2) accomplishes this through the 
values for the hypervariances. Small hypervariances help to zero out coeffi- 
cients, while large values inflate coefficients. The latter coefficients are the 
ones we would like to select in the final model. 

Example 1 {Two- component indifference priors). A popular version of 
the spike and slab model, introduced by George and McCulloch (1993), 
identifies zero and nonzero (3k 's by using zero-one latent variables J?k ■ This 
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identification is a consequence of the prior used for (5 k , which is assumed to 
be a scale mixture of two normal distributions: 

(PkW ~ d (1 - ^fc)N(0,r|) + ^ fc N(0,c fc r|), k = l,...,K. 

[We use the notation N(0,u 2 ) informally here to represent the measure of a 
normal variable with mean and variance v 2 .] The value for r| > is some 
suitably small value while c k > is some suitably large value. Coefficients 
that are promising have posterior latent variables = 1. These coefficients 
will have large posterior hypervariances and, consequently, large posterior 
(3 k values. The opposite occurs when ^f k = 0. The prior hierarchy for (3 
is completed by assuming a prior for J? k . In principle, one can use any 
prior over the 2 K possible values for . . . however, often J? k are 

taken as independent Bernoulli(w;fc) random variables, where < w k < 1. It 
is common practice to set w k = 1/2. This is referred to as an indifference, 
or uniform prior. It is clear this setup can be recast as a spike and slab 
model (2). That is, the prior 7r(d'y) in (2) is defined by the conditional 
distributions 

(3) (lk\c k ,rl J? k ) W (1 - ^ fc )^(-) + ^5^(0, k = 1, . . . ,K, 
(^k\wk) ~ d (1 - w k )5 (-) + w k 5i(-), 

where 6 V (-) is used to denote a discrete measure concentrated at the value 
v. Of course, (3) can be written more compactly as 

(lk\ck,T 2 ,w k ) '~ (1 - w k )5 T 2(-) +w k 5 ? (-), k = 1, . . . ,K. 

k K k 

However, (3) is often preferred for computational purposes. 
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Fig. 2. Conditional density for "/k , where vo = 0.005, ai = 5 and ai = 50 and (a) w = 0.5, 
(b) w = 0.95. Observe that only the height of the density changes as w is varied. [Note as 
w has a uniform prior, (a) also corresponds to the marginal density for 7^ .] 



<s 



H. ISHWARAN AND J. S. RAO 



Example 2 (Continuous bimodal priors). In practice, it can be difficult 
to select the values for r|, c^rf and Wk used in the priors for (3 and J?k- 
Improperly chosen values lead to models that concentrate on either too 
few or too many coefficients. Recognizing this problem, Ishwaran and Rao 
(2000) proposed a continuous bimodal distribution for jk in place of the 
two-point mixture distribution for 7^ in (3). They introduced the following 
prior hierarchy for (3: 

{(3k\^r 2 k) ~ d N(0,^r|), k=l,...,K, 
(jF k \vQ,w) (l-w)S Vo (-) + wS 1 (-), 

(4) 

(r k 2 \ai,a 2 ) '~ ' Gamma(ai, a 2 ), 

w ~ Uniform[0, 1]. 

The prior tt for 7 is induced by 7^ = ^t|, and thus integrating over w 
shows that (4) is a prior for j3 as in (2). 

In (4), vq (a small near zero value) and a\ and a 2 (the shape and scale 
parameters for a gamma density) are chosen so that 7^ = ^fkTk nas a contin- 
uous bimodal distribution with a spike at Vq and a right continuous tail (see 
Figure 2). The spike at vq is important because it enables the posterior to 
shrink values for the zero /3k coefficients, while the right-continuous tail is 
used to identify nonzero parameters. Continuity is crucial because it avoids 
having to manually set a bimodal prior as in (3). Another unique feature 
of (4) is the parameter w. Its value controls how likely J^k equals 1 or vq, 
and, therefore, it takes on the role of a complexity parameter controlling the 
size of models. Notice that using an indifference prior is equivalent to choos- 
ing a degenerate prior for w at the value of 1/2. Using a continuous prior 
for w, therefore, allows for a greater amount of adaptiveness in estimating 
model size. 

3. Rescaling, penalization and universality of priors. The flexibility of 
a prior like (4) greatly simplifies the problems of calibration. However, just 
like any other prior, its effect on the posterior vanishes as the sample size in- 
creases, and without some basic adjustment to the underlying spike and slab 
model, the only way to avoid a washed out effect would be to tune the prior 
as a function of the sample size. Having to adjust the prior is undesirable. 
Instead, to achieve a type of "universality," or sample size invariance, we in- 
troduce a modified rescaled spike and slab model (Section 3.1). This involves 
replacing the original Y$ values with ones transformed by a \fn factor. Also 
included in the models is a variance inflation factor needed to adjust to the 
new variance of the transformed data. To determine an appropriate choice 
for the inflation factor, we show that this value can also be interpreted as a 
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penalization shrinkage effect of the posterior mean. We show that a value of 
n is the most appropriate because it ensures that the prior has a nonvanish- 
ing effect. This is important, because as we demonstrate later in Section 5, 
this nonvanishing effect, in combination with an appropriately selected prior 
for 7, such as (4), yields a model selection procedure based on the posterior 
mean with superior performance over one using the OLS. 

For our results we require some fairly mild constraints on the behavior of 
covariates. 

Design assumptions. Let X be the n x K design matrix from the re- 
gression model (1). We shall make use of one, or several, of the following 
conditions: 

(Dl) J2i=i Xi,k = and £?=i x\ k = n for each k = l,...,K. 
(D2) maxi<j< n Hx^/y^n — > 0, where || • || is the ^-norm. 
(D3) X*X is positive definite. 

(D4) S n = X*X/n — > So, where So is positive definite. 

Condition (Dl) simply reiterates the assumption that covariates are cen- 
tered and rescaled. Condition (D2) is designed to keep any covariate Xj from 
becoming too large. Condition (D3) will simplify some arguments, but is un- 
necessary for asymptotic results in light of condition (D4). Condition (D3) 
is convenient, because it frees us from addressing noninvertibility for small 
values of n. It also allows us to write out closed form expressions for the 
OLS estimate without having to worry about generalized inverses. Note, 
however, that from a practical point of view, noninvertibility for S n is not 
problematic. This is because the conditional posterior mean is a generalized 
ridge estimator, which always exists if the ridge parameters are nonzero. 

Remark 1. We call /3 R a generalized ridge estimator for f3 if /3 R = 
(X*X + D) _1 X*Y, where D is a K x K diagonal matrix. Here Y = (Y h . . . , Y n ) 1 
is the vector of responses. The diagonal elements d±, . . . , (Ik of D are assumed 
to be nonnegative and are referred to as the ridge parameters, while D is 
referred to as the ridge matrix. If dk > for each k, then X*X + D is always 
of full rank. See Hoerl (1962) and Hoerl and Kennard (1970) for background 
on ridge regression. 

3.1. Rescaled spike and slab models. By a rescaled spike and slab model, 
we mean a spike and slab model modified as follows: 



(5) 



(Y/|x,,/3,ay^N(x^,a 2 A n ) 
((3\j) ~ N(0,T), 

7 ~ Tr(d'y), 



i = 1, . . . , n 
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where Y* = a^n^^Yi are rescaled Yj values, a 2 = \\Y — X/3 n || 2 /(n — K) is 
the unbiased estimator for cr 2 based on the full model and (3 n = (X < X) _1 X*Y 
is the OLS estimate for f3 from (1). 

The parameter A n appearing in (5) is one of the key differences between (5) 
and our earlier spike and slab model (2). One way to think about this value is 
that it's a variance inflation factor introduced to compensate for the scaling 
of the Yi's. Given that a y'n-scaling is used, the most natural choice for 
A n would be n, reflecting the correct increase in the variance of the data. 
However, another way to motivate this choice is through a penalization 
argument. We show that A n controls the amount of shrinkage and that a 
value of A n = n is the amount of penalization required in order to ensure a 
shrinkage effect in the limit. 

Remark 2. When A n = n, we have found that a 2 in (5) plays an im- 
portant adaptive role in adjusting the penalty A n , but only by some small 
amount. Our experience has shown that under this setting the posterior for 
a 2 will concentrate around the value of one, thus fine tuning the amount of 
penalization. Some empirical evidence of this will be provided later on in 
Section 8. 

Remark 3. Throughout the paper when illustrating the spike and slab 
methodology, we use the continuous bimodal priors (4) in tandem with the 
rescaled spike and slab model (5) under a penalization A n = n. Specifically, 
we use the model 

(Y*\x. h (3,a 2 ) ~ d N(x*/3,a 2 n), i = l,...,n, 

(A^,T fe 2 ) ~ N(0, J? k r 2 ), k = l,...,K, 

(y k \vo,w) (1 - w)5 V0 (-) + w5 1 (-), 

(6) 

(r fc 2 |^i, £"2) '~' Gamma(oi,a 2 ), 

w ~ Uniform[0, 1], 

a~ 2 ~ Gamma(6i, 62), 

with hyperparameters specified as in Figure 2 and b± = 62 = 0.0001. Later 
theory will show the benefits of using a model like this. In estimating pa- 
rameters we use the Gibbs sampling algorithm discussed in Ishwaran and 
Rao (2000). We refer to this method as Stochastic Variable Selection, or SVS 
for short. The SVS algorithm is easily implemented. Because of conjugacy, 
each of the steps in the Gibbs sampler can be simulated from well-known 
distributions (see the Appendix for details). In particular, the draw for a 2 is 
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from an inverse-gamma distribution, and, in fact, the choice of an inverse- 
gamma prior for a 2 is chosen primarily to exploit this conjugacy. Certainly, 
however, other priors for a 2 could be used. In light of our previous comment, 
any continuous prior with bounded support should work well as long as the 
support covers a range of values that includes one. This is important because 
some of the later theorems (e.g., Theorem 2 of Section 3.3 and Theorem 7 
of Section 5.5) require a bounded support for a 2 . Such assumptions are not 
unrealistic. 

3.2. Penalization and generalized ridge regression. To recast A n as a 
penalty term, we establish a connection between the posterior mean and 
generalized ridge regression estimation. This also shows the posterior mean 
can be viewed as a model averaged shrinkage estimator, providing motiva- 
tion for its use [see also George (1986) and Clyde, Parmigiani and Vidakovic 
(1998) for more background and motivation for shrinkage estimators]. Let 
/3 n (7,<7 2 ) = E(/3|7, a 2 , Y*) be the conditional posterior mean for (3 from (5). 
It is easy to verify 

Kh, v 2 ) = (x 4 x + <r 2 A n r~ 1 )~ 1 x*Y* 

= a" V/*(X*X + a 2 X n T-^-^'Y, 

where Y* = (Yj*, . . . , Y*) 1 . Thus, (3* n {j, cr 2 ) is the ridge solution to a regres- 
sion of Y* on X with ridge matrix <7 2 A n r _1 . Let n = E(/3|Y*) denote the 
posterior mean for j3 from (5). Then 

3* =a^ 1 n 1 / 2 y , {(X < X + f j 2 A n r~ 1 )~ 1 X*Y}(vr x fi)(dj,da 2 \Y*). 

— * 

Hence, (3 n is a weighted average of ridge shrunken estimates, where the 
adaptive weights are determined from the posteriors of 7 and a 2 . In other 
words, j3 n is an estimator resulting from shrinkage in combination with model 
averaging. 

Now we formalize the idea of A n as a penalty term. Define O n (-y,a 2 ) = 
(7 n/ 9 n (7 , c 2 ) I \fn. It is clear 6 n {~y,a 2 ) is the ridge solution to a regression 
of Y on X with ridge matrix <T 2 A n r~ 1 . A ridge solution can always be recast 
as an optimization problem, which is a direct way of seeing how A n plays a 
penalization role. It is straightforward to show that 

(7) n *( 7 , a 2 ) = argmin j ||Y - X/3|| 2 + A n £ a 2 ^ \3g j, 

which shows clearly that A n is a penalty term. 
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Remark 4. Keep in mind that to achieve this same kind of penalization 
effect in the standard spike and slab model, (2), requires choosing a prior 
that depends upon n. To see this, note that the conditional posterior mean 
3 n (7,<r 2 ) =E(/3|7,<r 2 ,Y) from (2) is of the form 

3 n ( 7 , v 2 ) = (x'x + ^r-^-^x'Y. 

Multiplying P n (j,a 2 ) by y/n/a n gives 3*(7,cr 2 ), but only if a 2 is 0(A n ), 
or if r has been scaled by 1/ A n . Either scenario occurs only when the prior 
depends upon n. 

3.3. How much penalization? The identity (7) has an immediate conse- 
quence for the choice of A n , at least from the point of view of estimation. 
This can be seen by Theorem 1 of Knight and Fu (2000), which establishes 
consistency for Bridge estimators (ridge estimation being a special case). 
Their result can be stated in terms of hypervariance vectors with coordi- 
nates satisfying 7i = • • • = 7& = 7o, where < 70 < 00. For ease of notation, 
we write 7 = 70 1, where 1 is the i^-dimensional vector with each coordinate 
equal to one. Theorem 1 of Knight and Fu (2000) implies the following: 

Theorem 1 [Knight and Fu (2000)]. Suppose that £j are i.i.d. such that 
E(ej) = and E(e 2 ) = o~q. If condition (D4) holds and X n /n — ► Ao > 0, then 

{ K 

0* (70 1, a 2 ) A argmin ( (5 - /3 )*E (/3 - A,) + Ao^" 1 ]T (3 2 k 

p I k=i 
In particular, if Xq = 0, then 9 n (j l, a 2 ) (3 . 

Knight and Fu's result shows there is a delicate balance between the 
rate at which A n increases and consistency for (3 . Any sequence A n which 
increases at a rate of 0(n) or faster will yield an inconsistent estimator, 
while any sequence increasing more slowly than n will lead to a consistent 
procedure. 

The following is an analogue of Knight and Fu's result applied to rescaled 
spike and slab models. Observe that this result does not require £j to be 
identically distributed. The boundedness assumptions for ir and [i stated 
in the theorem are for technical reasons. In particular, the assumption that 
a 2 remains bounded cannot be removed. It is required for the penalization 
effect to be completely determined through the value for A n , analogous to 
Theorem 1 (however, recall from Remark 3 that this kind of assumption is 
not unrealistic). 

Theorem 2. Assume that (1) holds where £j are independent such that 
E(Ej) = and E(e 2 ) = <rg. Let 9 n = a n l3Jy/n. A ssume that conditions (D3) 
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and (D4) hold. Also, suppose there exists some rjo > such that ir{^k > 
?7o} = 1 for each k = 1, . . . , K and that ii{a 2 < Sq} = 1 for some < Sq < oo. 

•*--Sle -—-0 p 

If A„/n ->■ 0, i/ien n = /3 n + O p (X n /n) ->■ /3 . 

4. Optimality of the posterior mean. Theorem 2 shows that a penaliza- 
tion effect satisfying \ n /n — > yields a posterior mean (after rescaling) that 
is asymptotically consistent for /3 . While consistency is certainly crucial 
for estimation purposes, it could be quite advantageous in terms of model 
selection if we have a shrinkage effect that does not vanish asymptotically 
and a posterior mean that behaves differently from the OLS. This naturally 
suggests penalizations of the form A n = n. 

The following theorem (Theorem 3) is a first step in quantifying these 
ideas. Not only does it indicate more precisely the asymptotic behavior of 
the posterior for (3, but it also identifies the role that the normal hierarchy 
plays in shrinkage. An important conclusion is that the optimal way to 
process the posterior in a local asymptotics framework is by the posterior 
mean. We then begin a systematic study of the posterior mean (Section 5) 
and show how this can be used for effective model selection. 

For this result we assume \ n = n. Note that because of the rescaling 
of the Yj's, the posterior is calibrated to a -^/re-scale, and thus some type of 
reparameterization is needed if we want to consider the asymptotic behavior 
of the posterior mean. We will look at the case when the true parameter 
shrinks to at a -^/n-rate. Think of this as a "local asymptotics case." In 
some aspects these results complement the work in Section 3 of Knight and 
Fu (2000). See also Le Cam and Yang [(1990), Chapter 5] for more on local 
asymptotic arguments. 

We assume that the true model is 

(8) Y ni = x*/3 n + e ni , i = l,..., n, 

where for each n, e n \, . . . ,e nn are independent random variables. The true 
parameter is j3 n = j3 /^/n. Let Y* { = ^JnY n i. To model (8) we use a rescaled 
spike and slab model of the form 

{Y*i |xi, (3) ~ N(x*/3, n), i = l,...,n, 

(9) (/3|7)~f(<*/3|7), 

7 ~ TT^dj), 

where v(d(3\~f) is the prior for (3 given 7. Write v for the prior measure for (3, 
that is, the prior for j3 marginalized over 7. Let f n (-| Y*) denote the posterior 
measure for (3 given Y* = {Y* 1: ■ ■ ■ ,Y* n ) 1 . For simplicity, and without loss 
of generality, the following theorem is based on the assumption that Uq is 
known. There is no loss in generality in making such an assumption, because 
if 0q were unknown, we could always rescale Y n i by y/nY n ija n and replace 
(3 n with 00(3^/^/71 as long as a 2 — > Oq. Therefore, for convenience we assume 
erg = 1 is known. 
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Theorem 3. Assume that v has a density f that is continuous and 
positive everywhere. Assume that (8) is the true regression model, where e n i 
are independent such that E(e n j) = 0, E(e^) = o~q = 1 and E(e^) < M for 
some M < oo. If (D1)-(D4) hold, then for each (3 l G R K and each C > 0, 

(10) \MS(Po ,C/JE)\Y* n )J 

- log(^j) /3o)'=o(/3i - A)) + (01 - 0o)*Z, 

where Z /ias a N(0, So) distribution. Here S((3,C) denotes a sphere centered 
at (3 with radius C > 0. 

Theorem 3 quantifies the asymptotic behavior of the posterior and its 
sensitivity to the choice of prior for (3. Observe that the log-ratio posterior 
probability on the left-hand side of (10) can be thought of as a random 
function of /3j. Call this function ^ / n (/3 1 ). Also, the expression on the right- 
hand side of (10), 

(11) " |(0i - A))*Eo(/3i - A)) + (0i " Po)% 

is a random concave function of fli with a unique maximum at (3 + Sq 1 Z, 
a N(/3 ,Sq~ 1 ) random vector. Consider the limit under an improper prior 
for f3, where /(0o) = f(Pi) l° r each f3 1 . Then \I' n (/3 1 ) converges in distri- 
bution to (11), which as we said has a unique maximum at a N(/3 ,Sq 1 ) 
vector. This is the same limiting distribution for y/n(3 nl the rescaled OLS, 
under the settings of the theorem. Therefore, under a flat prior the posterior 
behaves similarly to the distribution for the OLS. This is intuitive, because 
with a noninformative prior there is no ridge parameter, and, therefore, no 
penalization effect. 

On the other hand, consider what happens when (3 has a N(0,To) prior. 
Now the distributional limit of ^ n (/3 1 ) is 

I/3oT % - i/3*ro X 0! - i(0! - o )^o(0i - 0o) + (01 - 0o)*Z- 

As a function of this is once again concave. However, now the maximum 
is 

^(Eo + ro^CEoPo + Z), 

which is a N(V V X E l ) random vector, where V = I + Eg 1 r o 1 . 
Let QOIto) represent this limiting normal distribution. 

The distribution Q(-|7o) is quite curious. It appears to be a new type 
of asymptotic ridge limit. The next theorem identifies it as the limiting 
distribution for the posterior mean. 
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Theorem 4. Assume that (3 has a N(0,To) prior for some fixed Tq. 
Let /3 nn (7o) = IE( /3|^ , Y*) be the posterior mean from (9), where (8) is the 
irwe model. Under the same conditions as Theorem 3, we have /3 nn (7o) 
QOlTo)- 

Theorem 4 shows the importance of the posterior mean when coefficients 
shrink to zero. In combination with Theorem 3, it shows that in such settings 
the correct estimator for asymptotically maximizing the posterior must be 
the posterior mean if a normal prior with a fixed hypervariance is used. 
Notice that the data does not have to be normal for this result to hold. 

5. The Zcut method, orthogonality and model selection performance. 

Theorem 4 motivates the use of the posterior mean in settings where co- 
efficients may all be zero and when the hypervariance is fixed, but how does 
it perform in general, and what are the implications for variable selection? 
It turns out that under an appropriately specified prior for 7, the poste- 
rior mean from a rescaled spike and slab model exhibits a type of selective 
shrinkage property, shrinking in estimates for zero coefficients, while retain- 
ing large estimated values for nonzero coefficients. This is a key property of 
immense potential. By using a hard shrinkage rule, that is, a threshold rule 
for setting coefficients to zero, we can take advantage of selective shrinkage 
to define an effective method for selecting variables. We analyze the theoret- 
ical performance of such a hard shrinkage model estimator termed "Zcut." 
Our analysis will be confined to orthogonal designs (i.e., S n = So = I) for 
rescaled spike and slab models under a penalization of X n = n. Under these 
settings we show Zcut possesses an oracle like risk misclassification property 
when compared to the OLS. Specifically, we show there is an oracle hyper- 
variance 7 which leads to uniformly better risk performance (Section 5.3) 
and that this type of risk performance is achieved by using a continuous 
bimodal prior as specified by (4). 

5.1. Hard shrinkage rules and limiting null distributions. The Zcut pro- 
cedure (see Section 5.2 for a formal definition) uses a hard shrinkage rule 
based on a standard normal distribution. Coefficients are set to zero by 
comparing their posterior mean values to an appropriate cutoff from a 
standard normal. This rule can be motivated using Theorem 4. This will 
also indicate an alternative thresholding rule that is an adaptive function 
of the true coefficients. For simplicity, assume that fJ,{o~ = 1} = 1. Under 
the assumptions outlined above, Theorem 4 implies that /3 n (7), the condi- 
tional posterior mean from (5), is approximately distributed as Qni'l'j), a 
N(- v /nD/3 /o'o, D*D) distribution, where D is the diagonal matrix diag(7i/(l + 
71), ... ,7k-/(1 + tr-)) (to apply the theorem in the nonlocal asymptotics 



16 



H. ISHWARAN AND J. S. RAO 



case, simply replace /3 with y/n/3 /ao). Consequently, the (unconditional) 
posterior mean (3 n should be approximately distributed as 

Qn(') = /Qn("|7Md7|Y*). 

This would seem to suggest that in testing whether a specific coefficient f3k,o 
is zero, and, therefore, deciding whether its coefficient estimate should be 
shrunk to zero, we should compare its posterior mean value to the kth. 
marginal of Q* under the null f3k,o = 0. Given the complexity of the posterior 
distribution for 7, it is tricky to work out what this distribution is exactly. 
However, in its place we could use 

QMuii(-) = / N (°> (l^) 2 )^l Y *)- 

Notice that this is only an approximation to the true null distribution be- 
cause 7r(d7fc|Y*) does not specifically take into account the null hypothesis 
Pk,o — 0- Nevertheless, we argue that Q* k null is a reasonable choice. We will 
also show that a threshold rule based on Q k u is not that different from 
the Zcut rule which uses a N(0, 1) reference distribution. 

Both rules can be motivated by analyzing how 7r(d7fc|Y*) depends upon 
the true value for the coefficient. First consider what happens when fyo 7^ 




Covariates 



Fig. 3. Adaptive null intervals. Boxplots of simulated values from Ql null , with k sorted 
according to largest absolute posterior mean (data from Figure 1). Values from Q* k null were 
drawn within the SVS Gibbs sampler from a multivariate N(0, (j 2 V^S„V n ) distribution 
where V n = (En + T ) . Whiskers identify 90% null intervals. Superimposed are Zk, n 
frequentist test statistics (green squares) and estimated values for PI n (blue circles and 
red triangles used for zero and nonzero coefficients, resp.). 
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and the null is misspecified. Then the posterior will asymptotically concen- 
trate on large 7^ values and 7fc/(l + 7fc) should be concentrated near one (see 
Theorem 6 later in this section). Therefore, Q k null will be approximately 
N(0, 1). Also, when /3fc,o 7^ 0, the kth marginal distribution for Q n ( - |7) is 
dominated by the mean, which in this case equals <7^ 1 - v /n/3fc i o7fc/(l + 7fc)- 
Therefore, if 7^. is large, fi* k n is of order 

v^y/nPkfl + O p (l) = a^y/nPkflO. + O p (l/y/n)), 

which shows that the null is likely to be rejected if n is compared to a 
N(0, 1) distribution. 

On the other hand, consider when j3k,o = and the null is really true. Now 
the hypervariance 7^ will often take on small to intermediate values with 
high posterior probability and 7r(d7fc|Y*) should be a good approximation 
to the posterior under the null. In such settings, using a N(0, 1) in place of 
Qk null wm be slightly more conservative, but this is what we want (after 
all the null is really true). Let z a / 2 be the 100 x (1 — a/2) percentile of a 
standard normal distribution. Observe that 

a = P{|N(0,l)| >z a/2 } 
> |p{|N(0,l)|> V2 (j^) 'jvr^lY*) 

= Qk,null{\Pk,n\ — z a/2\- 

Therefore, a cut-off value using a N(0, 1) distribution yields a significance 
level larger than null- This is because QX nVL \\ nas a smaller variance 
^((7fc/(l + 7fc)) 2 |^*) an d, therefore, a tighter distribution. 

Figure 3 compares the two procedures using the data from our earlier sim- 
ulation (recall this uses a near orthogonal X design). Depicted are boxplots 
for values simulated from Q£ nu u for each k (see the caption for details). 
The dashed horizontal lines at ±1.645 represent a a = 0.10 cutoff using 
a N(0, 1) null, while the whiskers for each boxplot are 90% null intervals 
under Q^ nu u- In this example both procedures lead to similar estimated 
models, and both yield few false discoveries. In general, however, we prefer 
the N(0, 1) approach because of its simplicity and conservativeness. Never- 
theless, the Q* k null intervals can always be produced as part of the analysis. 
These intervals are valuable because they depict the variability in the pos- 
terior mean under the null but are also adaptive to the true value of the 
coefficient via 7r(d7&|Y*). 

5.2. The Zcut rule. The preceding argument suggests the use of a thresh- 
olding rule that treats the posterior mean as a N(0, 1) test statistic. This 
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method, and the resulting hard shrinkage model estimator, have been re- 
ferred to as Zcut [Ishwaran and Rao (2000, 2003, 2005)]. Here is its formal 
definition. 



Here a > is some fixed value specified by the user. The Zcut estimator for 
(3 is the restricted OLS estimator applied to only those coefficients in the 
Zcut model (all other coefficients are set to zero). 

Zcut hard shrinks the posterior mean. Hard shrinkage is important be- 
cause it reduces the dimension of the model estimator, which is a key to 
successful subset selection. Given that the posterior mean is already taking 
advantage of shrinkage, it is natural to wonder how this translates into per- 
formance gains over conventional hard shrinkage procedures. We compare 
Zcut theoretically to "OLS-hard," the hard shrinkage estimator formed from 
the OLS estimator (3 n = (Pi n , ■ ■ ■ ,Pk „)*• Here is its definition: 

The OLS-hard model estimator. The OLS-hard model corresponds 
to the model with coefficients (3k whose Z-statistics, Zk, n , satisfy \Zk >n \ > 
z a / 2 , where 



The OLS-hard estimator for f3 is the restricted OLS estimator using only 
OLS-hard coefficients. 

5.3. Oracle risk performance. If Zcut is going to outperform OLS-hard 
in general, then it is reasonable to expect it will be better in the fixed 
hypervariance case for some appropriately selected 7. Theorem 5, our next 
result, shows this to be true in the context of risk performance. We show 
there exists a value ■j = To that leads not only to better risk performance, 
but uniformly better risk performance. Let £3$ = {k : f3k,o = 0} be the indices 
for the zero coefficients of (3 . Define 




9 ™ 1/2 k 

ZjU „ — ~ ; TT 




and Skk is the kth diagonal value from £! n . That is, 



OLS-hard := {(3 k ■ \Zk,n\ > z a / 2 }- 



& z (a)= £ PM jn |>z Q/2 }+ £ PM, n |<^ /2 }. 
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This is the expected number of coefficients misclassified by Zcut for a fixed 
a-level. This can be thought of as the risk under a zero-one loss function. 
The misclassification rate for Zcut is Mz{cn) / K . Similarly, define 



to be the risk for OLS-hard. 

Theorem 5. Assume that the linear regression model (1) holds such 



N(0,ro) prior, fi{a 2 = 1} = 1 and X n = n. Then for each < 5 < 1/2 there 
exists a -y such that Mz{a) < Mo{a) for all a £ [5,1 — 5]. 

Theorem 5 shows that Zcut's risk is uniformly better than the OLS-hard 
in any finite sample setting if 7 is set at the oracle value 7 - Of course, in 
practice, this oracle value is unknown, which raises the interesting question 
of whether the same risk behavior can be achieved by relying on a well-chosen 
prior for 7. Also, Theorem 5 requires that £j are normally distributed, but 
can this assumption be removed? 

5.4. Risk performance for continuous bimodal priors. Another way to 
frame these questions is in terms of the posterior behavior of the hypervari- 
ances 7^. This is because risk performance ultimately boils down to their 
behavior. One can see this by carefully inspecting the proof of Theorem 5. 
There the oracle 7 is chosen so that its values are large for the nonzero fa o 
coefficients and small otherwise. Under any prior ir, 



In particular, for the it obtained by fixing 7 at 7 , the posterior mean 
is shrunk toward zero for the zero coefficients, thus greatly reducing the 
number of misclassifications from this group of variables relative to OLS- 
hard. Meanwhile for the nonzero coefficients, f3t n is approximately equal to 

Zk jn , so the risk from this group of variables is the same for both procedures, 
and, therefore, Zcut's risk is smaller overall. Notice that choosing 7 in 
this fashion also leads to what we have been calling selective shrinkage. So 
good risk performance follows from selective shrinkage, which ultimately is 
a statement about the posterior behavior of 7^. This motivates the following 
theorem. 

Theorem 6. Assume in (1) that condition (D2) holds and £j are inde- 
pendent such that E(£j) = ; E(e?) = Uq and E(ef ) < M for some M < 00. 
Suppose in (5) that ^{o 1 = 1} = 1 and X n = n. 



& (a)= ]T F{\Zk,n\>Z a/2 }+ £ n\Z k ,n\ < Z a/2 } 




that ko < K and where 
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(a) If the support for tt contains a set [t]q,oo) k for some finite constant 
Vo > 0, then, for each small 5 > 0, 



Ik 



[X 1 ■ T+^ k > 1 " 5 ) Y * ) 4 1 lf 0k '° + °' 

where 7r n (-|Y*) is i/ie posterior measure for 7. 

(b) Let denote the posterior density for 7^ given w. If it is the 

continuous bimodal prior specified by (4), then 

(13) fk(u\w) oc ex p( 2 (l + M ) ^ w ) ^ + M )~ 1/2 (( 1 ~ «0st>( u ) + u>fll(«))> 
w/iere O («) = «o« _2 #( v o« _1 )> ffiH =^~ 2 5'(^~ 1 ). 

5 ( n ) = ( Ql - i)T " ai ~ ex P(~ Q 2M), 

and 6,n = 5 n " 1 n^ 1 / 2 Er=i3 ; i,fc^- Note that if 0k,o = O, then £ k ,n & N(0, 1). 

Part (a) of Theorem 6 shows why continuity for 7r is needed for good 
risk performance. To be able to selectively shrink coefficients, the posterior 
must concentrate on arbitrarily large values for the hypervariance when 
the coefficient is truly nonzero. Part (a) shows this holds asymptotically 
as long as tt has an appropriate support. A continuous prior meets this 
requirement. Selective shrinkage also requires small hypervariances for the 
zero coefficients, which is what part (b) asserts happens with a continuous 
bimodal prior. Note importantly that this is a finite sample result and is 
distribution free. The expression (13) shows that the posterior density for 
7^ (conditional on w) is bimodal. Indeed, except for the leading term 

(14) ^(2(1^0^ 

which reflects the effect on the prior due to the data, the posterior density 
is nearly identical to the prior. What (14) does is to adjust the amount of 
probability at the slab in the prior (cf. Figure 2) using the value of £% n - 
As indicated in part (b), if the coefficient is truly zero, then £| n will have 
an approximate ^-distribution, so this should introduce a relatively small 
adjustment. Notice this also implies that the effect of the prior does not 
vanish asymptotically for zero coefficients. This is a key aspect of using a 
rescaled spike and slab model. Morever, because the posterior for 7^ will be 
similar to the prior when /3k,o = 0, it will concentrate near zero, and hence 
the posterior mean will be biased and shrunken toward zero relative to the 
frequentist Z-test. 
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On the other hand, if the coefficient is nonzero, then (14) becomes expo- 
nentially large and most of the mass of the density shifts to larger hyper- 
variances. This, of course, matches up with part (a) of the theorem. Figure 4 
shows how the posterior cumulative distribution function varies in terms of 
£j? n . Even for fairly large values of £| n (e.g., from the 75th percentile of a 
X 2 -distribution), the distribution function converges to one rapidly for small 
hypervariances. This shows that the posterior will concentrate on small hy- 
pervariances unless £| n is abnormally large. 

Figure 5 shows how the hypervariances might vary in a real example. 
We have plotted the posterior means /3| for the Breiman simulation of 
Figure 1 against E((7fc/(1 + 7fc)) 2 |Y*) (the variance of Q% nu a)- This shows 
quite clearly the posterior's ability to adaptively estimate the hypervari- 
ances for selective shrinkage. Figure 6 shows how this selective shrinkage 
capability is translated into risk performance. As seen, Zcut's misclassifica- 
tion performance is uniformly better than OLS-hard over a wide range of 
cut-off values, exactly as our theory suggests. 

Remark 5. The assumption in Theorems 5 and 6 that /i{cr 2 = 1} = 1 is 
not typical in practice. As discussed, it is beneficial to assume that a 2 has 

Standardized Hypervariance 
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a continuous prior to allow adaptive penalization. Nevertheless, Theorem 5 
shows even if cr 2 = 1, thus forgoing the extra benefits of finite sample adapta- 
tion, the total risk for Zcut with an appropriately fixed -y Q is still uniformly 
better than OLS-hard. The same argument could be made for Theorem 6. 
That is, from a theoretical point of view, it is not restrictive to assume a 
fixed a 2 . 

5.5. Complexity recovery. We further motivate Zcut by showing that it 
consistently estimates the true model under a threshold value that is allowed 
to change with n. Let 

4 = ^,0 7^0},..., 1(^,0^0})* 

be the X-dimensional binary vector recording which coordinates of /3 are 
nonzero [!(•) denotes the indicator function]. By consistent estimation of the 
true model, we mean the existence of an estimator ^ n such that -h> . 
We show that such an estimator can be constructed from f3 n . Let 

M n {C) = (l{\M, n \ > C}, • • .,I{\P* K ,n\ > C}f. 

The Zcut estimator corresponds to setting C = z a /2 ■ The next theorem shows 
we can consistently recover by letting C converge to oo at any rate slower 
than sfn. 

Theorem 7. Assume that the priors tt and [i in (5) are chosen so that 
T^{lk > %} = 1 for some 7]q>0 for each k = 1, . . . , K and that fi{a 2 < Sq} = 
1 for some < Sq < oo. Let ^ n = ^ n (C n ), where C n — ► oo is any positive 
increasing sequence such that C n j ' \fn^ 0. Assume that the linear regression 
model (1) holds where E{ are independent such that E(ej) = 0, = a 2 and 

E(ef) < M for some M < oo. // (D2) holds and \ n = n, then Jt n 

An immediate consequence of Theorem 7 is that the true model complex- 
ity ko can be estimated consistently. By the continuous mapping theorem, 
we obtain the following: 

Corollary 1. Let J( n = (~-#L, n , • ■ • ,-^K,nf and let k n = J2k=iH^k,n / 
0} be the number of nonzero coordinates of ^ n . Then, under the conditions 
of Theorem 7, k n — > ko . 

6. The effects of model uncertainty. In this section we prove an asymp- 
totic complexity result for a specialized type of forward stepwise model 
selection procedure. This forward stepwise method is a modification of a 
backward stepwise procedure introduced by Potscher (1991) and discussed 
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recently in Leeb and Potscher (2003). We show in orthogonal settings that if 
the coordinates of /3 are perfectly ordered a priori, then the forward stepwise 
procedure leads to improved complexity recovery relative to the OLS-hard. 
Interestingly, the backward stepwise procedure has the worst performance of 
all three methods (Theorem 8 of Section 6.3). This result can be used as an 
empirical tool for assessing a procedure's ability to reduce model uncertainty. 
If a model selection procedure is effectively reducing model uncertainty, then 
it should produce an accurate ranking of coefficients in finite samples. Con- 
sequently, the forward stepwise procedure based on this data based ranking 
should perform better than OLS-hard. This provides an indirect way to 
confirm a procedure's ability to reduce model uncertainty. 

Remark 6. The idea of pre-ranking covariates and then selecting mod- 
els has become a well established technique in the literature. As mentioned, 
this idea was used by Potscher (1991) and Leeb and Potscher (2003), but 
also appears in Zhang (1992), Zheng and Lo (1995, 1997), Rao and Wu 
(1989) and Ishwaran (2004). 

We use this strategy to assess the performance of a rescaled spike and 
slab model. For a data based ordering of (3, we use the absolute posterior 
means The first coordinate of (3 corresponds to the largest abso- 

lute posterior value, the second coordinate to the second largest value, and 
so forth. The data based forward stepwise procedure using this ranking is 
termed "svsForwd." Section 6.2 provides a formal description. In Section 8 
we use simulations to systematically compare the performance of svsForwd 
to OLS-hard as an indirect way to confirm SVS's ability to reduce model 
uncertainty. Figure 7 provides some preliminary evidence of this capability. 
There we have compared a ranking of (3 using the posterior mean against 
an OLS ordering using |Zfc, n |- Figure 7 is based on the simulation presented 
in Figure 1. 

We note that it is possible to consistently estimate the order of the /3 
coordinates using the posterior mean. Let Uk, n be the /cth largest value from 
the set {\(3* Kn \ : k = 1, . . . , K}. That is, U 1>n >U 2 , n >---> U K>n . Let 

M {n) = (I{C/l,n > C n }, . . .,I{U K ,n > Cn})*, 

where C n is a positive sequence satisfying C n — > oo and C n j\fn^> 0. By 
inspection of the proof of Theorem 7, Corollary 2 can be shown. 

Corollary 2. Under the conditions of Theorem 7, ~#( n ) (1, . . . , 1, f K 
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Fig. 7. (a) True rank of a coefficient versus estimated rank using posterior means (cir- 
cles) and OLS (squares). The lower the rank, the larger the absolute value of the coefficient. 
Data from Breiman simulation of Figure 1 (only nonzero coefficients shown), (b) Same 
plot as (a) but with true ranks averaged to adjust for ties in true coefficient values (sim- 
ulation used four unique nonzero coefficient values). Dashed line connects values for true 
average rank. Note the higher variability in OLS, especially for intermediate coefficients. 
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6.1. Backward model selection. We begin by reviewing the backward 
stepwise procedure of Potscher (1991). For notational ease, we avoid sub- 
scripts of n as much as possible. We assume the coordinates of /3 have been 
ordered, so that the first k$ coordinates are nonzero. That is, 

Po = (Pi,o, ■ • ■> /#fco,o> ir-fc ) ' 

where Ox-k is the (K — /co) -dimensional zero vector. We assume the design 
matrix X has been suitably recoded as well. Let X[fc] be the n x k design 
matrix formed from the first k columns of the re-ordered X. Let 

= (x[k] t x[k]y 1 x.[k] t Y 

be the restricted OLS estimator using only the first k variables. To test 
whether the kth coefficient (5k is zero, define the test statistic 

(15) Z,„- " I/2 »l 



Vn{skk[k]) 1/2 ' 

where Sfcfc[fe] is the kth diagonal value from (X[A;]*X[fc]/n)~ 1 . Let a\, . . . ,ok 
be a sequence of fixed positive a-significance values for the n test statis- 
tics. Estimate the true complexity k^ by the estimator kg, where 

k B = max{A: : \Z k , n \ > z ak/2 ,k = 0, . . . , K}. 

To ensure that ks is well defined, take Z 0jn = and z ao / 2 = 0. 

Observe if ks = k, then Zk, n is the first test starting from k = K and 
going backward to k = satisfying \Z k)n \ > z ak / 2 and \Zj, n \ < z a ./ 2 for j = 
k + 1, . . . , K. This corresponds to accepting the event {/3 : Pk+i — 0, . . . , (3k = 
0}, but rejecting {/3 : j3 k = 0, . . . , (5k = 0}. The post-model selection estimator 
for (3 is defined as 

K 

3 B = K I{k B = 0} + J2(P°[k} t ,0 t K-k) t Hk B = k}. 
k=i 

It should be clear that the estimators k B and (3 B are derived from a 
backward stepwise mechanism. 

Remark 7. Observe that Z^^ n uses a 2 , the estimate for <Tq based on the 
full model, rather than an estimate based on the first k variables, and so, 
in this way, is different from a conventional stepwise procedure. The latter 
estimates are only unbiased if k > k$ and can perform quite badly otherwise. 
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Remark 8. At first glance it seems the backward procedure requires K 
regression analyses to compute (3 [k] for each k. This would be expensive 
for large K, requiring a computational effort of 0(J2k=i^ 3 )- I n fact, the 
whole procedure can be reduced to the problem of finding an orthogonal 
decomposition of the X matrix, an 0(K S ) operation. This idea rests on the 
following observations implicit in Lemma A.l of Leeb and Potscher (2003). 
Let 

Pi = i-x[k](x[k] t x[k]y 1 x[k] t 

be the projection onto the orthogonal complement of the linear space spanned 
by X[fc]. Let denote the kth column vector of X (thus X[fc] = [xm, . . . , x^)]). 
Define 

ui = x (1) and u k = P^_ 1 x (fc) for k = 2, . . . , K. 

One can show that 

P° k [k] = (u{u k y 1 u{Y, k = l,...,K. 

Consequently, the backward procedure is equivalent to finding an orthogonal 
decomposition of X. (Note that this argument shows . . . , f3° K {K] are 

mutually uncorrelated if £j are independent, E(ej) = and E(ej) = <7q. See 
Lemma A.l of Leeb and Potscher (2003). This will be important in the proof 
of Theorem 8.) 

6.2. Forward model selection. A forward stepwise procedure and its asso- 
ciated post-model selection estimator for f3 can be defined in an analogous 
way. Define 

(16) k F = mm{k- l:|Z fe ,„| < z ah / 2 , k = 1, . . . ,K + 1}, 

where Zj<+i,n = and ax+i = are chosen to ensure a well-defined proce- 
dure. Observe if k F = k — 1, then Z k ^ n is the first test statistic such that 
\Zk,n\ < z a k /2, while \Zj t n\ > z a ./ 2 for j = 1, . . . , k — 1. This corresponds 
to accepting the event {/3:/?i / 0,...,(3 k ~i / 0}, but rejecting {/3 : j3\ ^ 
0, . . . 7^ 0}. Note that kp = if \Z\^ n \ < z ai / 2 - The post-model selection 
estimator for (3 is 

K 

(17) p F = o K i{k F = o} + ]T(3>]', ^K^Mkp = k}. 

k=l 

The data based version of forward stepwise, svsForwd, mentioned earlier is 
defined as follows: 
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The svsForwd model estimator. Re-order the coordinates of (3 
(and the columns of the design matrix X) using the absolute posterior means 
PI n from (5). If kp > 1, the svsForwd model is defined as 

svsForwd := {j3 k :k=l,..., kp}; 

otherwise, if &f = 0, let svsForwd be the null model. Define the svsForwd 
post-model selection estimator for /3 as in (17). 

6.3. Complexity recovery. The following theorem identifies the limiting 
distribution for ks and kp- It also considers OLS-hard. Let ko denote the 
OLS-hard complexity estimator (i.e., ko equals the number of parameters 
in OLS-hard). Part (a) of the following theorem is related to Lemma 4 of 
Potscher (1991). 

Theorem 8. Assume that (D1)-(D4) hold for (1), where e, are inde- 
pendent such that E(ffj) = 0, E(e?) = <7q and E(e|) < M for some M < oo. 
Let ks, kp and ko denote the limits for ks, kp and ko, respectively, as 
n — > oo. For 1 < k < K , 

(a) F{k B = k} = x I{k < k } + (1 - a ko+1 ) • • • (1 - a K )I{k = k } 

+ a k (l - atk+i) •••(!- a K )l{k > k }. 
Moreover, when X has an orthogonal design (i.e., S n = So = 1), 

(b) P{k F = k} = x I{k < k } + (1 - a ko+1 )I{k = k } 

+ (1 - a fc+ i)a fco+ i • • • a k I{k > k }. 

(c) P{k = k} = x l{k < k } 

+ ¥{B ko+1 + ... + B K = k- k }l{k > k }, 

where ctx+i = in (b) and B k are independent Bernoulli(afc) random vari- 
ables for k = ko + 1, . . . , K. 

Remark 9. Although the result (b) requires an assumption of orthog- 
onality, this restriction can be removed. See equation (38) of Corollary 4.5 
from Leeb and Potscher (2003). 

Theorem 8 shows that forward stepwise is the best procedure in orthogo- 
nal designs. Suppose that a k = a > for each k. Then the limiting probabil- 
ity of correctly estimating ko is ¥{kp = ko} = (1 — a) for forward stepwise, 
while for OLS-hard and backward stepwise, it is (1 — a) K ~ k ° . Notice if K — ko 
is large, this last probability is approximated by exp(— (K — ko)a), which 
becomes exponentially small as K increases. Simply put, the OLS-hard and 
backward stepwise methods are prone to overfitting. Figure 8 illustrates how 
the limiting probabilities vary under various choices for K and ko (all figures 
computed with a = 0.10). One can clearly see the superiority of the forward 
procedure, especially as K increases. 
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Fig. 8. Complexity recovery in the orthogonal case. Limiting probabilities versus model 
dimension k for the three estimators kF (— ), fcs (— ) and ko (— ): (a) K = 25, ko = 10, 
(b) K = 50, k = 20, (c) K = 100, ko = 50. In all cases a k = 0.10. 



7. Diabetes data example. As an illustration of the different model selec- 
tion procedures we consider an example from Efron, Hastie, Johnstone and 
Tibshirani (2004). In illustrating the LARS method, Efron, Hastie, John- 
stone and Tibshirani analyzed a diabetes study involving n = 442 patients 
in which the response of interest, Yj, is a quantitative measure of disease 
progression recorded one year after baseline measurement. Data included 
ten baseline variables: age, sex, body mass index, average blood pressure 
and six blood serum measurements. All covariates were standardized and Yi 
was centered so that its mean value was zero. Two linear regression models 
were considered in the paper. The first was a main effects model involving 
the 10 baseline measurements, the second, a "quadratic model," which we 
re-analyze here, was made up of 64 covariates containing the 10 baseline 
measurements, 45 interactions for the 10 original covariates and 9 squared 
terms (these being the squares of each of the original covariates except for 
the gender variable, which is binary). 

Table 1 contains the results from our analysis of the quadratic model. 
Listed are the top 10 variables as ranked by their absolute posterior means, 
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|/3£ |. Using an a = 0.10 criteria, Zcut chooses a model with six variables 
starting from the top variable "bmi" (body mass index) and ending with 
"age. sex" (the age-sex interaction effect). The seventh variable, "bmi. map" 
(the interaction of body mass index and map, a blood pressure measure- 
ment), is borderline significant. Table 1 also reports results using OLS-hard, 
svsForwd and a new procedure, "OLSForwd" (all using an a = 0.10 value). 
OLSForwd is the direct analogue of svsForwd, but orders /3 using Z-statistics 
Zk )T i in place of the posterior mean. For all procedures the values in Ta- 
ble 1 are Z-statistics (12) derived from the restricted OLS for the selected 
model. This was done to allow direct comparison to the posterior mean 
values recorded in column 2. 

Table 1 shows that the OLS-hard model differs significantly from Zcut. It 
excludes both "ltg" and "hdl" (blood serum measurements), both of which 
have large posterior mean values. We are not confident in the OLS-hard 
and suspect it is missing true signal here. The same comment applies to 
OLSForwd, which has produced the same model as OLS-hard. Note how 
svsForwd, the counterpart for OLSForwd, agrees closely with Zcut (it dis- 
agrees only on bmi. map, which is borderline significant). We believe the SVS 
models are more accurate than the OLS ones. In the next section we more 
systematically study the differences between the four procedures. 

Remark 10. Figure 9 displays the posterior density for a 2 . Note how 
the posterior is concentrated near one. This is typical of what we see in 
practice. 



Table 1 

Top 10 variables from diabetes data (ranking based on absolute 
posterior means \/3^ n \). Entries for model selection procedures are 
Z -statistics (12) derived from the restricted OLS for the selected 

model 





Variable 




Zcut 


OLS-hard 


svsForwd 


OLSForwd 


1 


bmi 


9.54 


8.29 


13.70 


8.15 


13.70 


2 


ltg 


9.25 


7.68 


0.00 


7.82 


0.00 


3 


map 


5.64 


5.39 


7.06 


4.99 


7.06 


4 


hdl 


-4.37 


-4.20 


0.00 


-4.31 


0.00 


5 


sex 


-3.38 


-4.03 


-1.95 


-4.02 


-1.95 


6 


age. sex 


2.43 


3.58 


3.19 


3.47 


3.19 


7 


bmi. map 


1.61 


0.00 


2.56 


3.28 


2.56 


8 


glu.2 


0.84 


0.00 


0.00 


0.00 


0.00 


9 


bmi.2 


0.46 


0.00 


0.00 


0.00 


0.00 


10 


tc.tch 


-0.44 


0.00 


0.00 


0.00 


0.00 
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Fig. 9. Posterior density for a 2 from diabetes analysis. 

8. Breiman simulations. We used simulations to more systematically 
study performance. These followed the recipe given in Breiman (1992). 
Specifically, data were generated by taking £j to be i.i.d. N(0, 1) variables, 
while covariates Xj were simulated independently from a multivariate nor- 
mal distribution such that E(xj) = and M(xijXi t k) = p^~ k \ where < p < 1 
was a correlation parameter. We considered two settings for p: (i) an uncor- 
rected design, p = 0; (ii) a correlated design, p = 0.90. For each p setting 
we also considered two different sample size and model dimension configu- 
rations: (A) n = 200 and K = 100; (B) n = 800 and K = 400. Note that our 
illustrative example of Figure 1 corresponds to the Monte Carlo experiment 
(B) with p = 0. 

In the higher-dimensional simulations (B), the nonzero /3k o coefficients 
were in 15 clusters of 7 adjacent variables centered at every 25th variable. For 
example, for the variables clustered around the 25th variable, the coefficient 
values were given by /?25+j,o = \h — il 1 ' 25 for \j\ < h, where h = 4. The other 
14 clusters were defined similarly. All other coefficients were set to zero. This 
gave a total of 105 nonzero values and 295 zero values. Coefficient values 
were adjusted by multiplying by a common constant to make the theoretical 
R 2 value equal to 0.75 [see Breiman (1992) for a discussion of this point]. 

Simulations (B) reflect a regression framework with a large number of 
zero coefficients. In contrast, simulations (A) were designed to represent a 
regression model with many weakly informative covariates. For (A), nonzero 
flkfi coefficients were grouped into 9 clusters each of size 5 centered at every 
10th variable. Each of the 45 nonzero coefficients was set to the same value. 
Coefficient values were then adjusted by multiplying by a common constant 
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to make the theoretical R 2 value equal to 0.75. This ensured that the overall 
signal to noise ratio was the same as (B), but with each coefficient having 
less explanatory power. 

Simulations were repeated 100 times independently for each of the four 
experiments. Results are recorded in Table 2 for each of the procedures Zcut, 
svsForwd, OLS-hard and OLSForwd (all using an a = 0.10 value). Table 2 
records what we call "TotalMiss," "FDR" and "FNR." The TotalMiss is the 
total number of misclassified variables, that is, the total number of falsely 
identified nonzero (3^,0 coefficients and falsely identified zero coefficients. 
This is an unbiased estimator for the risk discussed in Theorem 5. The 
FDR and FNR are the false discovery and false nondiscovery rates defined 
as the false positive and false negative rates for those coefficients identified 
as nonzero and zero, respectively. The TotalMiss, FDR and FNR values 
reported are the averaged values from the 100 simulations. Also recorded 
is k, the average number of variables selected by a procedure. Table 2 also 
includes the performance value "Perf," a measure of prediction accuracy, 
defined as 

P erf=1 Jl x 3-XA,||* 



l|XA>|| 2 

where (3 is the estimator for /3 . So Perf equals zero when (3 = and equals 
one when (3 = (3 . The value for Perf was again averaged over the 100 sim- 
ulations. 



Table 2 
Breiman simulations 



p = (uncorrelated X) p = 0.9 (correlated X) 



k Perf TotalMiss FDR FNR k Perf TotalMiss FDR FNR 

(A) Moderate number of covariates with few (55%) that are zero 
(n = 200, K = 100 and 55 zero (3 kfi ). 

Zcut 41.44 0.815 11.99 0.097 0.129 10.06 0.853 38.49 0.167 0.408 

svsForwd 34.02 0.753 15.09 0.054 0.191 8.31 0.826 39.39 0.156 0.415 

OLS-hard 41.99 0.791 14.06 0.128 0.145 11.08 0.707 45.31 0.496 0.446 

OLSForwd 26.90 0.612 20.92 0.042 0.258 5.96 0.574 44.64 0.459 0.445 

(B) Large number of covariates with many (74%) that are zero 
(n = 800, K = 400 and 295 zero f3 kfi ). 

Zcut 75.96 0.903 39.62 0.068 0.106 36.67 0.953 72.61 0.055 0.194 

svsForwd 86.81 0.904 41.19 0.130 0.095 24.42 0.926 81.90 0.025 0.216 

OLS-hard 106.74 0.883 58.54 0.279 0.097 45.41 0.706 121.37 0.676 0.255 

OLSForwd 61.09 0.846 49.87 0.046 0.138 9.14 0.303 106.48 0.590 0.259 
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Remark 11. Given the high dimensionality of the simulations, both 
svsForwd and OLSForwd often stopped early and produced models that 
were much too small. To compensate, we slightly altered their definitions. 
For svsForwd, we modified the definition of k F [cf. (16)] to 

k F = min{k - 1 : \Z k>n \ < z ak/2 and |/3£J < C,k = I, . . . , K + 1}, 

where C = 3. In this way, svsForwd stops the first time the null hypothesis 
is not rejected and if the absolute posterior mean is no longer a large value. 
The definition for OLSForwd was altered in similar fashion, but using Z^ )n 

in place of P% n . 

8.1. Results. The simulations revealed several interesting patterns, sum- 
marized as follows: 

1. Zcut beats OLS-hard across all performance categories. It maintains low 
risk, has small FDR values and has good prediction error performance in 
both the near-orthogonal (uncorrelated) and nonorthogonal (correlated) 
X cases. Performance differences between Zcut and OLS-hard become 
more appreciable in the near-orthogonal simulation (B) involving many 
zero coefficients, because this is when the effect of selective shrinkage 
is most pronounced. For example, the OLS-hard misclassifies about 19 
coefficients more on average, and has a FDR more than 4 times larger 
than Zcut's. Large gains are also seen in the correlated case (B). There, 
the OLS-hard misclassifies over 48 more coefficients on average than Zcut 
and its FDR is more than 12 times higher. 

2. It is immediately clear upon comparing svsForwd to OLSForwd that SVS 
is capable of some serious model averaging. These two procedures differ 
only in the way they rank coefficients, so the disparity in their two per- 
formances is clear evidence of SVS's ability to model average. 

3. In the p = simulations, svsForwd is roughly the same as OLS-hard in 
simulation (A) and significantly better in simulation (B). In the corre- 
lated setting, svsForwd is significantly better. Thus, overall svsForwd is 
as good, and in most cases significantly better, than OLS-hard. This sug- 
gests that svsForwd is starting to tap into the oracle property forward 
stepwise has relative to OLS-hard and provides indirect evidence that 
SVS is capable of reducing model uncertainty in finite samples. 

4. It is interesting to note how badly OLSForwd performs relative to OLS- 
hard in simulation (A) when p = 0. In orthogonal designs, OLSForwd is 
equivalent to OLS-hard, but the p = design is only near-orthogonal. 
With only a slight departure from orthogonality, we see the importance 
of a reliable ranking for the coordinates of (3. Note that this effect is less 
pronounced in simulation (B) because of the larger sample size. This is 
because X*X/n I as n — > oo, so simulation (B) should be closer to 
orthogonality. 
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5. While our theory does not cover Zcut's performance in correlated settings, 
it is interesting to note how well it does in the p = 0.9 simulations relative 
to OLS-hard. The explanation for its success here, however, is probably 
different from that for the orthogonal setting. For example, it is possible 
that its performance gains may be mostly due to the use of generalized 
ridge estimators. As is well known, such estimators are much more sta- 
ble than OLS in multicollinear settings. We should also note that while 
Zcut is better than OLS-hard here, its performance relative to the or- 
thogonal simulations is noticeably worse. This is not unexpected though. 
Correlation has the effect of reducing the dimension of the problem. So 
performance measurements like TotalMiss and FDR will naturally be less 
favorable. 

APPENDIX: PROOFS 

PROOF of Theorem 2. We start by establishing that 3° is consistent, 
which is part of the conclusion of Theorem 2. First observe that 

P° n = n- l V~ l X t Y = (3 + A n , 

where A n = S~ 1 X*e/n and e = (e\, . . . ,£ n )*- From E(A n ) = and Var(A n ) = 
OqS" 1 /?!, it is clear that f3 n (3 . Next, a little bit of rearrangement shows 
that 

S;( 7 , v 2 ) = (i - (a- 2 A^x*x + r- 1 )- 1 r~ 1 )3;. 

Consequently, 

K = K - J (<r- 2 A^ 1 X*X + r- 1 )- 1 ^ 1 ^ (vr x /i)(d 7 , da 2 \Y*) 

= K-Kl ^V^r-^vrx/xX^^lY*), 

where A* = X n /n and V n = S n + <r 2 A*r^ 1 . By the Jordan decomposition 
theorem, we can write V n = J2k=i e k,n^k,n^k n , where {vfc jn } is a set of or- 
thonormal eigenvectors with eigenvalues {efc ifl }. For convenience, assume 
that the eigenvalues have been ordered so that ei >n < ■•■ < ex,n- The as- 
sumption that S„ — ► So, where So is positive definite, ensures that the 
minimum eigenvalue for S n is larger than some eo > for sufficiently large 
n. Therefore, if n is large enough, 

ei, n > eo + cr 2 A* min7r 1 > e > 0. 

k 

Notice that 

HVr-^H 2 = E e^in^K) 2 < e» 2 \\K\\ 2 E Ik 2 - 

k=l k=l 
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Thus, since 7^ > 770 over the support of tt, and a 2 < Sq over the support of 

"a^-'T-'X (7TX^)(dj,da 2 \Y*) 

( K \ 1/2 

<e^\\Wn\\ E/ <T 4 %\*xri(dj,da 2 \Y*)\ 



. fc=i 



Deduce that 0*= 3°+ O p ( A;) ^/3 . □ 

Before proving Theorem 3, we state a lemma. This will also be useful in 
the proofs of some later theorems. 

Lemma A.l. Assume that for each n, e n \, . . . ,e nn are independent ran- 
dom variables such that E(e„j) = 0, E(e^) = Oq and E(e^) < M for some 
finite M. If (D1)-(D4) hold, then 



(18) rT^y^en = n- 1 ' 2 E W ^ N(0, o%V ), 

where e n = (e nl , ...,e T 



i=l 



-III! J 



PROOF. Let S n = Ya=i ^ni^t-l '\fn, where t 6 R K is some arbitrary nonzero 
vector. Let s 2 n = a 2 ] £ t '£ n £ and define C, n i = n _1 / 2 £ n jX*.£/ ' s n . Then, S n /s n = 
Sr=iCm! where Cni are independent random variables such that E(£ n j) = 
and X)r=i^(Cn«) = To prove (18), we will verify the Lindeberg condition 

n 

E E (CniI{ICni| > 5}) -+ for each 5 > 0, 

i=l 

where I(-) denotes the indicator function. This will show that S n /s n ~~> 
N(0, 1), which by the Cramer- Wold device implies (18) because s 2 — > Oq^YiqI. 
Observe that 

HCM\Cni\ > 5}) = ^^E(e 2 m I{\em\ > r ni s n 8}), 
nsi 

where r n i = -y/n/lx-^l. By the Cauchy-Schwarz inequality and the assump- 
tion of a bounded fourth moment for e n j, 

E(e 2 ni I{\e ni \ > r ni s n 5}) < (E(<4) P{|e m | > r ni s n 5}) 1/2 < r^!^° 
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Bound r n i below by 

r n i > r n := I max |x-£|/Vn) . 

\l<i<n / 

Notice that r n — > oo by the assumption that maxj ||xj H/y^ — > 0. Substituting 
the bound for r n j, and since (x*£) 2 sums to ns 2 /<7o, and s 2 remains bounded 
away from zero since s 2 — > u^Eo^ 

n M l/2 

j2m 2 nM\u>s})< -r-o. 

Proof of Theorem 3. Let (f>(-\m,T 2 ) denote a normal density with 
mean m and variance r 2 . By dividing the numerator and denominator by 
n~ K / 2 Y[i=i4>(Yni\ x i(3o,n), one can show that 

u n {S{f3^C/^i)\Y* n ) = n K ' 2 J IJJ3 £ S( (3, , C/^/n )}L n ( g) „(rf/3) 
1 j u n (S(/3 ,C/y/E)\Y*) nK/*fI{[3eS((3 ,C/vK)}L n (p)u(d(3y 
where 

log(L n (/3)) = -i(/3 - /3 )*S n (/3 - /3 ) + n" 1 / 2 5> ni x*(/3 - /3 ). 

Consider the denominator in (19). By making the change of variables from 
(3 to u = y/n(f3 — (3q), we can rewrite this as 

(20) J I{uG5(0,C)}L n0 (u)/(/3 + n" 1 / 2 u)rfu, 

where log(L n0 (u)) = g n (u) + 0(l/n) and^ n (u) = YZ=i e ni y.\\i/n. TheO(l/n) 
term corresponds to u* S n u/n and is uniform over u € 5(0, C) . Observe that, 
for each 5 > 0, 

nb.(u)| > 5} <^E E M«) 2 <^2-2-ElMI 2 = 0(1), 

i=l i=l 

where the last inequality on the right-hand side follows from the Cauchy- 
Schwarz inequality and from the assumption that maxj HxjU/y^ = o(l). 

Therefore, <7n(u) uniformly over u G S(0,C). Because / is continuous 
[and keeping in mind it remains positive and bounded over 5(0, C)], deduce 
that the log of (20) converges in probability to 

(21) log(/(/3 ))+log(7 1{uG5(0,C)}du 

Meanwhile, for the numerator in (19), make the change of variables from 
(3 to u = \/n((3 — Pi) to rewrite this as 

(22) j E{u€5(0,C)}L„ 1 (u)/(/3 1 +n- 1 / 2 u)du, 
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where 

log(L nl (u)) = -1(0! - /3 )*S (/3i - P ) 

n 

+ n- 1 ' 2 £ e n o4(Pi ~ Po) + 9n(u) + o(l) 
i=i 

uniformly over u G 5(0, C) . Consider the second term on the right-hand 
side of the last expression. Set I = P 1 — /3 . By Lemma A.l, since Oq = 1, it 
follows that 

n 

n -1 / 2 5^e n< xj£^N(0,^Eo£). 

i=l 

Now extract the expressions not depending upon u outside the integral 
in (22), take logs and use g n (u) uniformly over u G 5(0, C) to deduce 
that the log of (22) converges in distribution to 

-i(A-/3 )%(/3i-/9o) 
(23) + ( A - PoY Z + log(/( /3 X )) + log (| I{u G 5(0, C)} du) . 

To complete the proof, take the difference of (23) and (21) and note the 
cancellation of the logs of /I{u G S(0,C)} du. □ 



Proof of Theorem 4. First note that 

Plnho) = (X'X+nT^r^Y* = (S + r - 1 )- 1 S /3 + n- 1 / 2 V- 1 X*e n + o (l), 

where V n = S n + Tq . The o(l) term on the right-hand side is due to 
S n — >S . Also, by Lemma A.l, the second term on the right-hand side 
converges in distribution to (So + Tq ) _1 Z, where Z has a N(0, So) distri- 

"""" * d 

bution. Deduce that P nn ("fo) ~^ QuTo)- ^ 

PROOF of Theorem 5. A little algebra (keeping in mind S n = I) shows 
Pn = y/n(l + Tq ) P n /a n . Hence, recalling the definition (12) for Z^ n , 

n* _ j ^k,n j ^ 
Pfc,n — "fc,0 x — «fc,0 X M' 

where dkfl = 7fc,o/(l + 7fc,o) and the last equality holds because Skk = 1- 
Under the assumption of normality, \^nf3^, n has a N(mfc in ,o"o) distribution, 
where mk >n = y/nflkfl- Choose 7 such that dkfl = Si for each k G and 
dk,Q = $2 for each k G i^j, where < 81,62 < 1 are values to be specified. 
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Therefore, 

= (K- A;o)(P{|N(0,a 2 )| > a n z a/2 } - P{|N(0, a 2 )! > 5^a n z a/2 }) 

+ ]T (F{\N(m kjTl ,al)\ <a n z a/2 } - P{|N(m fcjn , cr 2 ,)) < 5 2 l d n z a/2 }), 

where the P-distributions on the right-hand side correspond to the joint 
distribution for a normal random variable and the distribution for a n , where 
(T^/cJq has an independent ^-distribution with n — K degrees of freedom. It 
is clear that the sum on the right-hand side can be made arbitrarily close to 
zero, uniformly for a G [6, 1 — 5] , by choosing 5 2 close to one, while the first 
term on the right-hand side remains positive and uniformly bounded away 
from zero over a 6 [5, 1 — 5] whatever the choice for Si. Thus, for a suitably 
chosen S 2 , Mo{o) — Mz(a) > for each a £ [5,1 — 5]. □ 

Proof of Theorem 6. Choose some j € Let Aj = {7 : dj < 1 - 5}, 
where dj = 7j/(l + 7?)- To prove part (a), we show that 

X4./(Y*| 7 )7r(d7) p 

„ (a _ JA 3 v p ; n 

By definition, /(Y*| 7 ) = J7(Y*|/3)/(/3| 7 ) d/3, where 
/(Y*|/3)/(/3| 7 ) = Cexp(-i-(Y* - X/3)'(Y* - X/3) - ^T' 1 ^ \V\~ 1 / 2 

and C is a generic constant not depending upon 7. By some straightforward 
calculations that exploit conjugacy and orthogonality, 



-1/2 



(24) /(Y*|7) = Cex P (|^4^,n) II C 1 + ^ 

V k=l ) k=l 

where • • ■ , = S^n-^x'Y. 

Let B = {7 : 1 - 5 k <d k <l- 5 k /2, k = l,..., K}, where < 6 k < 1 are 
small values that will be specified. Observe that 

X 4 ./(Y*| 7 )7r(d7) 

Over the set Aj we have the upper bound 

/(Y*|7)<Cexp(l(^e^+ E **,» + (! 
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because < d& < 1, while over B we have the lower bound 



/(Y*| 7 )>Cexpji( £ (l-^ n + (l-^.)^|n(|) ^ 

An application of Lemma A.l (which also applies to nontriangular arrays) 
shows 

(6,n, • • • ,6^ = £nV /2 A) + O p (l)). 

Therefore, 

vr„(A i |Y*)<expjo p (l) + -^( £ 5 fe /% 
(25) 1 n Vfc e«o-0'} 



AH 



7.u ■ ^ • 

Choose <5j < 5. It is clear we can find a set of values {5k j} chosen small 
enough so that 

E S k pl + (5 3 -5)(3l <0. 
ke,%-{j} 

This ensures that the expression in the exponent of (25) converges to — oo 
in probability. Note for this result we assume a\ has a nonzero limit (we 
give a proof shortly that a\ <Tq). Therefore, since tt(B) must be strictly 
positive for small enough 5k > (by our assumptions regarding the support 

for 7r), conclude from (25) that ir n (Aj\Y*) 0. 

To prove part (b), let fk{lk\w) denote the density for 7^ given w. From (4), 
it is seen that fk{lk\w) = (1 - w)go(lk) +wgi{lk)- Therefore, 

fk{lk\w) oc f(y\i)fh(lk\w) 

ocexp(±4^ n )(l + 7fc)~ 1/2 /fc(7fc|^), 
which is the expression (13). Furthermore, by Lemma A.l deduce that £k,n 

converges to a standard normal if (3k,o = (we are using Oq, which still 
needs to be proven). 

To complete the proof, we now show o\ is consistent. For this proof we do 
not assume orthogonality, only that X! n is positive definite (this generality 
will be useful for later proofs). Observe that o\ = (e t e — e'He)/(n — K), 
where H = X(X*X) _1 X*. It follows from Chebyshev's inequality using the 
moment assumptions on £j that e t e/(n — K) o~q, while from Markov's 
inequality, for each 5 > 0, 

n«n« > (» - m < -feM - ^("gf )) - *sL _ 0. 

(n — A Jo (?t, — A Jd (n — Ajo 
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Deduce that a\. □ 

Proof of Theorem 7. Under the assumption of orthogonality, and 
using the fact that X n = n, it follows that 

3^(7, ° 2 ) = <^ V/ 2 D/3 + fr^n-^DX'e, 

where D is the diagonal matrix diag(di, . . . ,cIk) and dk = 7fc/(7fc + c 2 )- 
Taking expectations with respect to the posterior, deduce that 

(26) M,n = K l n 1/2 dWk, + K^Uk^n, 

where d^ = E(cZfc|Y*) and Cfc,n is the kth. coordinate of X*e/-y/n. From Lemma A.l 

and a 2 a 2 , (proven in Theorem 6), we have a~ 1 X t e/- v /n ~~> N(0, 1). There- 
fore, because < d£ < 1, deduce that the second term on the right-hand 
side of (26) is O p (l). Now consider the first term on the right-hand side 
of (26). By our assumptions regarding the support of tt and fi, we must have 
d% > f]o/{rio + Sq). Thus, d* k remains bounded away from zero in probability. 
Hence, because C n / 1 ^fn — >0, we have proven that 



0, if k£& , 

oo, otherwise. Q 



Proof of Theorem 8. As the proof is somewhat lengthy, we first give 
a brief sketch. The basis for the proof will rely on the following result: 

( 0, if k < k < K, 

(27) P° k [k]^\Pkofi, if k = k , 

l/3fc,o + A fc)0 , if l<k<k , 

where Afc i0 is the kth coordinate of So[fc : A;] -1 So [A; : — k]P [— k]. Here (3 Q [—k] - 
( A),fc+i, • • • > Po,kY, while Eo[fc : and So [k '■ —k] are the kx k and A x [K — 
k) matrices associated with So which has been partitioned according to 

H [k:k] E [jfe:-jfe] 
H [-k:k] H [-k:-k] 

First consider what (27) implies when k < k®. Recall the definition (15) 
for Zk t n- Using a 2 Cq (shown in the proof of Theorem 6) and that 
converges to the kth diagonal value of So[/c : fc] -1 , a strictly positive value, 
deduce from the second limit of (27) that P{|Zfc n \ > z ak / 2 } - > 1- Thus, for 
(a), 

¥{k B = k}= n\Z k ,n\ > z* k /2 and \Z jin \ < z aj/2 for j = k + 1, . . . ,K} 

<n\z ko , n \<z ako/2 }^o. 
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For (b), observe that A^o = (by our assumption of orthogonality). Thus, 
when k<ko, the last two lines of (27) imply that P{|Zfc )n | > z ak / 2 } — ¥ 1) and 
therefore, 

F{k F = k - 1} = P{|^fc, n | < z ak/2 and \Z^ n \ > z a . /2 for j = 1, . . . , k - 1} 

<P{\Z k ,n\ <Z ak/2 }^0. 

Now for (c), due to orthogonality, Zk )U = -Zfc.n for Zfc >n defined by (12). Thus, 
Hko > k } > n\Zj, n \ > z aj /2 for j = 1, . . . , k } 1. 

Thus, for all three estimators the probability of the event {k < /co} tends to 
zero. 

Now consider when k > ko- We will show for (a) [and, therefore, for 
(b) and (c)] 

(28) Z n = (Z ko+lin , . . ..Zk^) 1 N(Ojf_ fco ,I), 

which implies {Zk +i, n , ■ ■ ■ , Zk,u\ are asymptotically independent. By (28), 

K 

P{fc fl = fc}^P{|N(0,l)|>* afc/2 } J] P{|N(0,l)|<z aj/2 }, 

j=k+i 

which is the third expression in (a). For (b), using (28) and the assumed 
orthogonality, 

k 

F{k F = k}^F{mO,l)\<z ak+l/2 } P{|N(0,l)|>z Qj/2 }. 

j=k +i 

Meanwhile, for OLS-hard, when k > ko or k = ko, 

P{fco = fc}-pJ Y, K\Z j \>z a . /2 } = k-k \, 

where {Z)- 0+ i, . . . , Zk } are mutually independent N(0, 1) variables. This is 
the second expression in (c). Deduce that (a), (b) and (c) hold (the case 
k = ko for ks and kp can be worked out using similar arguments). 

This completes the outline of the proof. Now we must prove (27) and (28). 
We start with (27). Let Po[k] = (/3o,i> • • • >A),fc)*- Some simple algebra shows 
that 

, . 3°[fc] =p [k] + (X[fc]*X[fc])- 1 X[fc]*X[-fc]i9o[-*] 

1 ' + (X.[k] t X[k])- 1 X.[k] t e, 
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where X[— k\ refers to the design matrix which excludes the first k columns 
of X. It is easy to show that the third term on the right-hand side is o p (l). 
Thus, it follows that 

3° [k] O [k] +V [k:kr 1 -£ [k:- k)P [—k] , 

which is what (27) asserts. 

Finally, we prove (28). By (29), /3%[k] is the fcth coordinate of (Xf/cJ'X^])- 1 x 
X[/c]*e when k> ko, and thus, 

Z M = (Oti,(^M)- 1/2 )(X[fc]*X[fc]/n)- 1 (^n 1 ^ 1/2 X[fc]*e) 
= (v{,O t K _ k )Z n , 

where £ n = <t~ X*e/-\/ra and is the fc-dimensional vector defined by 

v£ = (OU, ( aw [fc])- 1 / 2 )(X[fc]*X[fc]/n)- 1 . 
This allows us to write Z n = V n £ n , where 

( v fc +l'^-fco-l) 
v K J 

Thus, because £ n N(0, So) by Lemma A.l, we have 

Z n ^N(0^_ feo ,VoEoV^, 

where Vo is the limit of V n . In particular, V n S n V^ — > VoSoVq. To complete 
the proof, we show Vo^oVq = I by proving that V ra S n V n = I. Note by 
tedious (but straightforward) algebra that v[S n V{. = 1. Consider v*I] n Vfc 
when j / k and j > ko. By (29), when k > ko, 

P° k [k}=n~ 1 Skk [k} 1 / 2 v{X t e. 

By Remark 8 we know that P ko+ i[ko + 1], . . . ,j3° K [K] are uncorrelated. Thus 
E 0j \j\h\ k \) = 0iij^k, and therefore, 

= E(v*X t ev' fe X*e) = v*X i E(ee t )Xv fc = CTgv*X*Xv fe . 

Thus, v*S n v fc = 0. Deduce that V n S n V^ = I and, hence, that V S Vq = I. 
□ 



v fc +l 
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SVS Gibbs sampler. 

Algorithm. The SVS procedure uses a Gibbs sampler to simulate pos- 
terior values 

(f3,J,r,w,a 2 \Y*) 

from (6), where J = (J*!, . . . , ^k) 1 and r = (n, . . . , tk) '■ Recall that 7^ = 
J^r|, so simulating J and r provides a value for 7. The Gibbs sampler 
works as follows: 

1. Simulate ((3\~f,cr 2 , Y*) ~N(/i,cr 2 S), the conditional distribution for (3, 
where 

/x = SX'Y* and £ = (X*X + ff 2 nr -1 ) _1 . 

2. Simulate horn its conditional distribution 

(S k \0, r, «,) ^ d Wl ; k 5 V0 {.) + ^ ft(-), k = 1, . . . ,K, 

where 

Pi 



m,k = (1 - w)v Q 1/2 exp ^~ 



2vot£ 



and 



w 2 ,fc = -u;exp^ - — -2 

3. Simulate 7)7 2 from its conditional distribution, 

(r fc ~ 2 |/3, J) W Gamma L + i a 2 + ^ , k = l,...,K. 

4. Simulate iu, the complexity parameter, from its conditional distribution, 

H 7 ) ~ Beta(l + : 7fe = 1}, 1 + #{k : lk = v }). 

5. Simulate cr~ 2 from its conditional distribution, 

(<r- 2 |/3, Y*) ~ Gammaffti + J,&2 + ^-||Y* - X/3|| 2 Y 
V 2 in / 

6. This completes one iteration. Update 7 by setting 7^ = J^r| for k = 
1,...,K. 

Computations for large K. The most costly computation in run- 
ning the Gibbs sampler is the inversion 
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required in updating (3 in step 1. This requires 0(K 3 ) operations and can 
be tremendously slow when K is large. 

A better approach is to update j3 in B blocks of size q. This will reduce 
computations to order 0(B~ 2 K 3 ), where K = Bq. To proceed, decompose 
(3 as (^ 1) ,...,/3| B) ) t , T as diag(r (1) , . . . ,r (s) ) and X as [X (1) , . . . , X (B) ]. 
Now update each component P(j), j = 1,...,B, conditioned on the remaining 
values. Using a subscript — (j) to indicate exclusion of the jth component, 
draw f}(j\ from a N(/ij, cr 2 Tij) distribution, where 

t*i = VjXt^Y* - X_ w /3_ w ) and S j = (XfoXyj + a 2 nT^y\ 

Notice that the cross-product terms X/^Xy) and Xy\X_^ can be ex- 
tracted from X*X and do not need to be computed. 

Acknowledgments. The authors thank Benedikt Potscher and Hannes 
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